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(57) A system and method are disclosed that pro- 
vides transparent, global access to devices on a com- 
puter cluster. The present system generates unique de- 
vice type (dev_t) values for all devices and correspond- 
ing links between a global file system and the devj val- 
ues. The file system is modified to take advantage of 
this framework so that, when a user requests that a par- 
ticular device, identified by its logical name, be opened, 
an operating system kernel queries the file system to 
determine that device's devj value and then queries 
the a device configuration system (DCS) for the location 



(node) and identification (local address) of a device with 
that devj value. Once, it has received the device's lo- 
cation and identification, the kernel issues an open re- 
quest to the host node for the device identified by the 
DCS. File system components executing on the host 
node, which include a special file system (SpecFS), 
handle the open request by returning to the kernel a han- 
dle to a special file object that is associated with the de- 
sired device. The kernel then returns to the requesting 
user a file descriptor that is mapped to the handle, 
through which the user can access the device. 



Node 202 



< 

o 
o 

O) 

oo 
oo 

o 

Q_ 
UJ 



Memory 


230 




OS Routines/Objects 


240 






Kernel 


242 






PXFS 


244 






PXFS clients (opt) 


246 






PXFS servers (opt) 


248 






t_objs (opt) 


250 






vnodes (opt) 


252 






snodes (opt) 


254 






px vnodes (opt) 


256 






SpecFS 


258 






DDI Framework 


270 






attach 


272 






link generators 


274 






DSOs 


290 






DSO_enum 


292 






DSO_nodespec 


294 






DSCTgtobal 


296 






OSO_nodebound 


298 






Device Drivers 


280 




0 


a ta Structures 


30(3 






Devtnfo Tree 


302 / 






ddi_minor_nodes 


306' 


C 




cesiOO 



Node 204 

Memory 
OS Routines/Objects 
DCS 

map minor 
DSOs ~ 
map_minor 

Data Structures 
DCS_database 



330 
340 
360 
362 
290 
364 

370 
372 



global minor number 
308 


local minor number 
310 " 3 


iev. class 
12 N< 









dev.ertumerate 314. dev_r»despecific 316, devjjlobal 313, 
devjwdebound 320 

FIG. 6 



BNSOOCID: <EP 0889400A1 J_> 



Printed by Jouve, 75001 PARIS (FR) 



EP 0 889 400 A1 

Description 

The present invention relates generally to systems and methods for accessing physical devices attached to a 
computer and, particularly, to systems and methods for accessing physical devices on a computer cluster. 

5 

BACKGROUND OF THE INVENTION 

It has become increasingly common for Unix-based computer applications to be hosted on a cluster that includes 

a plurality of computers. It is a goal of cluster operating systems to render operation of the cluster as transparent to 
10 applications/users as if it were a single computer. For example, a cluster typically provides a global file system that 

enables a user to view and access all conventional files on the cluster no matter where the files are hosted. This 

transparency does not, however, extend to device access on a cluster. 

Typically, device access on Unix-based systems is provided through a special file system (e.g. , SpecFS) that treats 

devices as files. This special file system operates only on a single node. That is, it only allows a user of a particular 
is node to view and access devices on that node, which runs counter to the goal of global device visibility on a cluster. 

These limitations are due to the lack of coordination between the special file systems running on the various nodes as 

well as a lack of a device naming strategy to accommodate global visibility of devices. These aspects of a prior art 

device access system are now described with reference to FIGS. 1-4. 

Referring to FIG. 1 , there is shown a block diagram of a conventional computer system 1 00 that includes a central 
20 processing unit (CPU) 102, a high speed memory 104, a plurality of physical devices 106 and a group of physical 

device interfaces 108 (e.g., busses or other electronic interfaces) that enable the CPU 102 to control and exchange 

data with the memory 102 and the physical devices 106. The memory 102 can be a random access memory (RAM) 

or a cache memory. 

The physical devices 1 06 can include but are not limited to high availability devices 1 1 2, printers 1 1 4, kernel memory 

25 n6, communications devices 118 and storage devices 120 (e.g., disk drives). Printers 114 and storage devices 120 
are well-known. High availability devices 112 include devices such as storage units or printers that have associated 
secondary devices. Such devices are highly available as the secondary devices can fill in for their respective primary 
device upon the primary's failure. The kernel memory 116 is a programmed region of the memory 102 that includes 
accumulating and reporting system performance statistics. The communications devices 118 include modems, ISDN 

30 interface cards, network interface cards and other types of communication devices. The devices 106 can also include 
pseudo devices 122, which are software devices not associated with an actual physical device. 

The memory 104 of the computer 100 can store an operating system 130, application programs 150 and data 
structures 160. The operating system 130 executes in the CPU 102 as long as the computer 100 is operational and 
provides system services for the processor 102 and applications 150 being executed in the CPU 102. The operating 

35 system 130, which is modeled on v. 2.6. of the Solaris™ operating system employed on Sun® workstations, includes 
a kernel 1 32, a file system 1 34, device drivers 140 and a device driver interface (DDI) framework 142. Solaris and Sun 
are trademarks and registered trademarks, respectively, of Sun Microsystems, Inc. The kernel 116 handles system 
calls from the applications 150, such as requests to access the memory 104, the file system 134 or the devices 106. 
The file system 134 and its relationship to the devices 106 and the device drivers 140 is described with reference to 

40 FIGS. 2A and 2B. 

Referring to FIG. 2A, there is shown a high-level representation of the file system 134 employed by v. 2.6 and 
previous versions of the Solaris operating system. In Solaris, the file system 134 is the medium by which all files, 
devices 106 and network interfaces (assuming the computer 100 is networked) are accessed. These three different 
types of accesses are provided respectively by three components of the file system 1 34: a Unix file system 1 38u (UFS), 

45 a special file system 138s (SpecFS) and a network file system 138n (NFS). 

In Solaris, an application 1 50 initially accesses a file, device or network interface (all referred to herein as a target) 
by issuing an open request for the target to the file system 134 via the kernel 132. The file system 134 then relays the 
request to the UFS 138u, SpecFS 138s or NFS 138n, as appropriate. If the target is successfully opened, the UFS, 
SpecFS or NFS returns to the file system 1 34 a vnode object 1 36 that is mapped to the requested file, device or network 

so node. The file system 1 34 then maps the vnode object 1 36 to a file descriptor 1 74, which is returned to the application 
1 50 via the kernel 1 32. The requesting application subsequently uses the file descriptor 1 74 to access the correspond- 
ing file, device or network node associated with the returned vnode object 1 36. 

The vnode objects 1 36 provide a generic set of file system services in accordance with a vnode/VFS interface or 
layer (VFS) 172 that serves as the interface between the kernel 132 and the file system 134. Solaris also provides 

ss inode, snode and mode objects 1 36i, 1 36s, 1 36r that inherit from the vnode objects 1 36 and also include methods and 
data structures customized for the types of targets associated with the UFS : SpecFS and NFS, respectively. These 
classes 1 36i, 1 36s and 1 36r form the low level interfaces between the vnodes 1 36 and their respective targets. Thus, 
when the UFS, SpecFS or NFS returns a vnode object, that object is associated with a corresponding inode, snode or 
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mode that performs the actual target operations. Having discussed the general nature of the Solaris file system, the 
focus of the present discussion will now shift to the file-based device access methods employed by Solaris. 

Referring to FIG. 2B, Solaris applications 150 typically issue device access requests to the file system 134 (via 
the k rnel 132) using the logical name 166 of the device they need opened. For example, an application 150 might 
s request access to a SCSI device with the command: open(/dev/dsk/disk_logicai_address). 

The logical name, /dev/dsk/diskJogicaLaddress, indicates that the device to be opened is a disk at a particular logical 
address. In Solaris, the logical address for a SCSI disk might be 'cOtOdOsx", where "cO* represents SCSI controller 0, 
tO represents target 0, dO represents disk 0, and sx represents the xth slice for the particular disk (a SCSI disk drive 
can have as many as eight slices). 
io The logical name is assigned by one of the link generators 144, which are user-space extensions of the DDI 

framework 142, and is based on information supplied by the device's driver 140 upon attachment of the device and a 
corresponding physical name for the device generated by the DDI framework 142. When an instance of a particular 
device driver 140 is attached to the node 100, the DDI framework 142 calls the attach routine of that driver 140. The 
driver 1 40 then assigns a unique local identifier to and calls the ddi_create_minor_nodes method 1 46 of the DDI f rame- 

15 work 142 for each device that can be associated with that instance. Typically, the unique local identifier constitutes a 
minor name (e.g., "a") and a minor number (e.g., ■2"). Each time it is called, the ddLcreate_minor_nodes method 146 
creates a leaf node in the Devlnfo tree 162 that represents a given device. For example, because a SCSI drive (i.e., 
instance) can have up to eight slices (i.e., devices), the local SCSI driver 140 assigns unique local identifiers to each 
of the eight slices and calls the ddi_create_minor_nodes method 146 with the local identifiers up to eight times. 

20 Also associated with each device 1 06 is a UFS file 1 70 that provides configuration information for the'target device 

106. The name of a particular UFS file 170i is the same as a physical name 168i derived from the physical location of 
the device on the computer. For example, a SCSI device might have the following physical name 168, /devices/iommu/ 
sbus/esp1/sd@addnminor_name, where addr \s the address of the device driver sd and minor_name is the minor 
name of the device instance, which is assigned by the device driver sd. How physical names are derived is described 

25 below in reference to FIG. 3. 

To enable it to open a target device given the target device's logical name, the file system 1 34 employs a logical 
name space data structure 164 that maps logical file names 166 to physical file names 168. The physical names of 
devices 106 are derived from the location of the device in a device information (Devlnfo) tree 140 (shown in FIG. 1), 
which represents the hierarchy of device types, bus connections, controllers, drivers and devices associated with the 

30 computer system 100. Each file 170 identified by a physical name 168 includes in its attributes an identifier, or dev_t 
(short for device type), which is uniquely associated with the target device. This devj value is employed by the file 
system 1 34 to access the correct target device via the SpecFS 1 38s. It is now described with reference to FIG. 3 how 
devj values are assigned and the Devlnfo tree 140 maintained by the DDI framework 142. 

Referring to FIG. 3, there is shown an illustration of a hypothetical Devlnfo tree 162 for the computer system 100. 

35 Each node of the Devlnfo tree 162 corresponds to a physical component of the device system associated with the 
computer 100. Different levels correspond to different levels of the device hierarchy. Nodes that are directly connected . 
to a higher node represent objects that are instances of the higher level object. Consequently the root node of the 
Devlnfo tree is always the T node, under which the entire device hierarchy resides. The intermediate nodes (i.e., 
nodes other than the leaf and leaf-parent nodes) are referred to as nexus devices and correspond to intermediate 

to structures, such as controllers, busses and ports. At the next to bottom level of the Devlnfo tree are the device drivers, 
each of which can export, or manage, one or more devices. At the leaf level are the actual devices, each of which can 
export a number of device instances, depending on the device type. For example, a SCSI device can have up to seven 
instances. 

The hypothetical Devlnfo tree 162 shown in FIG. 3 represents a computer system 100 that includes an input/output 
45 (i/o) controller for memory mapped r/o devices (iommu) at a physical address addrO. The iommu manages the CPU's 
interactions with t/o devices connected to a system bus (sbus) at address addrl and a high speed bus, such as a PCI 
bus, at address addr2. Two SCSI controllers (espl and esp2) at respective addresses atfoV3 and addr4 are coupled 
to the sbus along with an asynchronous transfer mode (ATM) controller at address addrS. The first SCSI controller 
espl is associated with a SCSI device driver (sd) at address 0 (represented as @0) that manages four SCSI device 
50 instances (devO, dev1 , dev2, dev3). Each of these device instances corresponds to a respective slice of a single, 
physical device 106. The first SCSI controller espl is also associated with a SCSI device driver (sd) at address 1 that 
manages plural SCSI device instances (not shown) of another physical device 106. 

Each type of device driver that can be employed with the computer system 100 is assigned a predetermined, 
unique major number. For example, the SCSI device driver sd is assigned the major number 32. Each device is asso- 
55 ciated with a minor number that, within the group of devices managed by a single device driver, is unique. For example, 
the devices devO, dev1 , dev2 and dev3 associated with the driver sd at address 0 have minor numbers 0, 1 , 2 and 3 
and minor names a, b, c, d, respectively. Similarly, the devices managed by the driver sd at address 1 would have 
minor numbers distinct from those associated with the devices devO-dev3 (e.g., four such might have minor numbers 
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4.7), The minor numbers and names are assigned by the parent device driver 140 (FIG. 1 ) for each new device instance 
(recall that a SCSI instanc might be a particular SCSI drive and a SCSI device a particular slice of that drive). This 
ensures that each device exported by a given device driver has a uniqu minor number and name. That is, a driver 
manages a minor number-name space. 

5 Each minor number, when combined with the major number cf its parent driver, forms a devj value that uniquely 

identifies each device. For example, the devices devO, dev1 , dev2 and dev3 managed by the driver sb at addr ss 0 
have respective devj values of (32,0), (32,1), (32,3) and (32,3). The SpecFS 138s maintains a mapping of devj 
values to their corresponding devices. As a result, all device open requests to the SpecFS identify the device to b 
opened using its unique dev_t value. 

10 The DevTree path to a device provides that device's physical name. For example, the physical name of the device 

devO is given by the string: 

/devices/iommu@addr0/$bus@addr1/esp1@addr3/sd@0:a, where sd@0:a refers to the device managed by the sd 
driver at address 0 whose minor name is a; i.e., the device devO. The physical name identifies the special file 170 
(shown in FIG. 2) (corresponding to an snode) that holds all of the information necessary to access the corresponding 
is device. Among other things, the attributes of each special file 170 hold the devj value associated with the correspond- 
ing device. 

As mentioned above, a link_generator 144 generates a device's logical name from the device's physical name 
according to a set of rules applicable to the devices managed by that link generator. For example, in the case of the 
device devO managed by the driver sd at address 0, a link generator for SCSI devices could generate the following 

20 logical name, /dev/dsk/c0t0d0s0, where cO refers to the controller espl @addr3, tO refers to the target id the physical 
disk managed by the sd@0 driver, dO refers to the sd@0 driver and sO designates the slice with minor name a and 
minor number 0. The device devO associated with the sd@1 driver could be assigned the logical name, dev/dsk/ 
c0t1d1$4, by the same link generator 1 44. Note that the two devO devices have bgical names distinguished by differ- 
ences in the target, disk and slice values. It is now described with reference to FIG. 4 how this infrastructure is presently 

25 employed in Solaris to enable an application to open a particular device residing on the computer 100. 

Referring to FIG. 4, there is shown a flow diagram of operations performed in the memory 104 of the computer 
100 by various operating system components in the course of opening a device as requested by an application 150. 
The memory 104 is divided into a user space 104U in which the applications 150 execute and a kernel space 104K in 
which the operating system components execute. This diagram shows with a set of labeled arrows the order in which 

so the operations occur and the devices that are the originators or targets of each operation. Where applicable, dashed 
lines indicate an object to which a reference is being passed. Alongside the representation of the memory 104, each 
operatbn associated with a labeled arrow is defined. The operations are defined as messages, or function calls, where 
the message name is followed by the data to be operated on or being returned by the receiving entity. For example, 
the message (4-1 ), ■openOogicaLname)," is the message issued by the application 1 50 asking the kernel 1 32 to open 

35 the device represented in the user space 104U by "logicaLname". In this particular example, the application is seeking 
to open the device dev2. 

After receiving the open message (4-1), the kernel ! 32 issues the message (4-2), "get_vnode(logical_name),' to 
the file system 134. This message asks the file system 134 to return the vnode of the device dev2, which the kernel 
132 needs to complete the open operation. In response, the file system 134 converts the logical name 166 to th 

40 corresponding physical name 168 using the logical namespace 164. The file system 134 then locates the file designated 
by the physical name and determines the devj value of the corresponding device from that file's attributes. Once it 
has acquired the devj value, the file system 1 34 issues the message (4-3), "get_vnode(devJ), a to the SpecFS 1 38s. 
This message asks the SpecFS 138s to return a reference to a vnode linked to the device dev2. Upon receiving the 
message (4-3) the SpecFS 138s creates the requested vnode 136 and an snode 136s, which links the vnode 136 to 

45 the device dev2, and returns the reference to the vnode 136 (4-4) to the file system 134. The file system 134 then 
returns the vnode reference to the kernel (4-5). 

Once it has the vnode reference, the kernel 132 issues a request (4-6) to the SpecFS 138s to open the devic 
dev2 associated with the vnode 136. The SpecFS 138s attempts to satisfy this request by issuing an open command 
(4-7) to driver 2, which the SpecFS knows manages the device dev2. If driver 2 is able to open the device dev2, it 

so returns an open_status message (4-8) indicating that the open operation was successful. Otherwise, driver 2 returns 
a failure indication in the same message (4-8). The SpecFS 138s then returns a similar status message (4-9) directly 
to the kernel 132. Assuming that "success'' was returned in message (4-9), the kernel 132 returns a file descriptor to 
the application 1 50 that is a user space representation of the vnode 1 36 linked to the device dev2 (4-10). The application 
150, once in possession of the file descriptor, can access the device dev2 via the kernel 132 and the file system 134 

55 using file system operations. For example, the application 150 performs inputs data from the device dev2 by issuing 
read requests directed to the returned file descriptor. These file system commands are then transformed into actual 
device commands by the SpecFS 136s and the vnode and snode objects 136, 1 36s that manage the device dev2. 
Consequently, Solaris enables users of a computer system 100 to access devices on that system 100 with relative 
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ease. However, the methods employed by Solaris do not permit users to transparently access devices across com- 
puters, even when the different computers ar configured as part of a cluster. That is, an application running on a first 
computer cannot, using Solaris, transparently open a device on a second computer. 

The reason that the current version of Solaris cannot provide transparent device access in th multi<omput r 

5 situation has to do with the way the dev_t and minor numbers are currently assigned when devices are attached. 
Referring again to FIG. 3, each time a device is attached to the computer 100 the devic "s associated driver assigns 
that device a minor number that is unique within the set of devices controlled by that driver and therefore can be mapped 
to a unique devj value for the computer 100 when combined with the driver's major number. However, if the same 
devices and driver were provided on a second computer, the driver and devices would be assigned a similar, if not 

io identical, set of major and minor numbers and dev_t values. For example, if both computers had a SCSI driver sd 
(major num = 32) and four SCSI device instances managed by the SCSI driver sd, each-driver sd would allocate the 
same set of minor numbers to their local set of SCSI devices (e.g., both sets would have minor numbers between 0 
and 3). Consequently, keeping in mind that a device is accessed according to its devj value, if a first node application 
wanted to open a SCSI disk on the second node, that application would not be able to unambiguously identify the SCSI 

J5 disk to the SpecFS on either computer system. 

Therefore, there is a need for a file-based device access system that enables applications, wherever they are 
executing, to transparently access devices resident on any node of a computer cluster. 

SUMMARY OF THE INVENTION 

20 

Particular and preferred aspects of the invention are set out in the accompanying independent and dependent 
claims. Features of the dependent claims may be combined with those of the independent claims as appropriate and 
in combinations other than those explicitly set out in the claims. 

In summary, the present invention is a system and method that provides transparent, global access to devices on 
25 a computer cluster. 

An embodiment of the invention includes a common operating system kernel running on each of the nodes com- 
posing the cluster, a file system running on all of the nodes; a device driver interface (DDI) running on each of the 
nodes, a device configuration system (DCS) running on one of the nodes, a DCS database accessible to the DCS and 
a plurality of device drivers located on each of the nodes. 

30 Each of the device drivers manages one type of physical device and is associated with a unique, predetermined, 

major number. When a new device of a particular type is attached to a respective node, an attach message is issued 
to that node's DDI indicating configuration information of the device being attached. The DDI, using the configuration 
information, creates a physical name in the file system name space for the device and a logical name that is a symbolic 
link to the physical name. The logical name for the device can subsequently be used to access the device via - the file 

35 system. 

As part of creating the logical name the DDI issues a map request to the DCS to request a global minor (gmin) 
number for the attached device. The map request message includes, among other things, the major number and at 
least a subset of the configuration information. 
In response to the map request, the DCS is configured to: 

40 

(a) determine the gmin number, 

(b) return the gmin number to the DDI, and 

(c) store the gmin number, the major number and the subset of the configuration information. 

45 The requesting DDI then forms the logical name and derives a dev_t value associated with the device using th 

returned gmin number and updates local device information so that the device's dev_t value is accessible from the file 
system. 

By providing a unique dev_t value for all devices and a link between the file system and that dev_t value, the 
present invention provides a global framework that enables devices on different nodes to be globally accessible. The 

so file system is modified to take advantage of this framework so that, when a user requests that a particular device, 
identified by its logical name, be opened, the kernel queries the file system to determine that device's devj value and 
then queries the DCS for the location and identification of a device with that devj value. Once it has received the 
device's location and identification, the kernel issues an open request to the host node for the device identified by the 
DCS. File system components executing on the host node, which include a special file system (SpecFS), handle the 

55 open request by returning to the kernel a handle to a special file object that is associated with the desired device. The 
kernel then returns to the requesting user a file descriptor that is mapped to the handle, through which the user can 
access the device. 

In a preferred embodiment, the DCS, file system, user and device being requested can all be on different nodes. 
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To function in this environment the present invention includes a proxy file system, which enables the users of a dust r 
node to communicate transparently with file objects co-located with a requ sted device on another node. 

The present invention can also include a set of device server objects (DSOs) on each node of the clust r, ach of 
which manages a particular class of devices. Th r spective device classes captur the particularity with which a user's 
s request to open a particular device must be satisfied by the transparent, global device access system, in general, and 
the DCS, in particular. In a preferred embodiment there are four device classes: dev_enumerate, dev_node_specific, 
dev_global and dev_nodebound. 

The dev_enumerate class is associated with devices that can have multiple instances at a particular node that are 
enumerated by their associated driver when each device is attached (e.g., multiple SCSI disks). The dev_node_specific 
io class is associated with devices of which there is only one instance per node (e.g., kernel memory) and, as a result, 
are not enumerated by their drivers. The dev_global class is for those devices that can be accessed either locally or 
remotely using a driver that is resident on each node (e.g., modems and network interfaces). The dev_nodebound 
class is used for devices that can only be accessed using a driver on a particular node and, if that particular node 
becomes unavailable, then by a driver on another node (e.g., highly available devices). 
'5 When classes are employed, the device configuration information issued by the driver to the DDI preferably includes 

the device's class. If available, the DDI includes this class information in its map request to the DCS. Upon receiving 
a map request including class information, the DCS consults its local DSO for that class. That DSO then determines 
the gminor number that should be assigned to the device being attached. For example, the DSO for the dev_enumate 
class assigns .each dev_enumerate device a gmin number that is unique across the cluster because each enumerated 
20 device must be accessed at a specific node. In contrast, the DSO for the dev_global class assigns each global device 
the same gmin value because it is immaterial at which node such devices are accessed. As for the other classes, the 
DSO for the dev_node specific class assigns each device of that class the same, non-null gmin value, and the DSO 
for the dev_nodebound class assigns each device of that class a gmin number that is unique across the cluster. 

If the class information is not provided by a driver, the present invention treats the corresponding device as if it 
25 were of the dev_enumerate class or the dev_global class depending on whether it is a physical device (dev_enumerate) 
or a pseudo device (dev_global). 

BRIEF DESCRIPTION OF THE DRAWINGS 

so Exemplary embodiments of the invention are described hereinafter by way of example only with reference to the 

accompanying drawings, in which: 

FIG. 1 is a block diagram of a prior art computer system showing components used to provide access to devices 
on a single computer; 

35 

FIG. 2 is a block diagram showing the relationships in the prior art between applications, the operating system 
kernel, the file system and the devices; 

Fl G. 2B is a block diagram showing the relationships in the prior art between device logical names, physical names, 
40 the file system, device type identifiers (devj) and devices. 

FIG. 3 is a diagram of an exemplary device information tree (Devlnfo Tree) consistent with those employed in the 
prior art. 

45 FIG. 4 is a flow diagram of operations performed in the memory 104 of the prior art computer system 100 in the 

course of opening a device as requested by an application 150; 

FIG. 5 is a block diagram of a computer cluster in which the present invention can be implemented; 

50 FIG. 6 is a block diagram of memory programs and data structures composing the present invention as imple- 

mented in representative nodes 202 and 204 of the cluster of FIG. 5; 

FIG. 7A is a flow diagram that illustrates the operations by which the device driver interface (DDI) Framework and 
the device configuration system (DCS) establish an appropriate dev_t value, logical name and physical name for 
55 a device being attached to the node 202; 

FIG. 7B illustrates the relationship between the local minor name/number, physical name and logical name estab- 
lished by the present invention; and 
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FIGS 8A and 8B are flow diagrams that illustrate the steps performed by the present invention in response to a 
^esHrom a„ appLtion 1 SO executing on a node 202-1 to access (open) a devce that resides on a node 202-3. 

DESCRIPTION OF THE PREFERRED EMBODIMENT 

Referring to Figure 5. there is shown a block diagram of a computer cluster 210 in which the present invention can 
be denied The cluster 201 includes a plurality of nodes 202 with associated devices 106 and applcat^ns 150. 
As h^RG ^ the devices 106can indudehigh availability devices 112. printers 114, kernel memory 116 commun«at»n 
JevTce 118 and storage devices 120. For the purposes of the present discussion a globalfile system 206 wh.ch 
Sta ns s^gle global file space for all files stored on the cluster 201 , runs on one of the nodes 202 The globa file 
sTem 206^ supports at least two representations of the devices 106. The physical name space PNS) representation 
m£3Lam kernel spac/and corresponds to the phystea. arrangement of the dev.ee 106 on the respectwe 
nodes 202. The logical name space (LNS) representation 304 is a user space version oftt. *P^™™*J>™ »f ■ 
i r each entry in the logical name space 304 maps to a corresponding entry .n the phys.cal name space 305. The 
P^entLen«ol«l^ 

?06 l^the appLtions 150. The cluster 201 also includes a node 204 that hosts a device configuration system (DCS) 
908 that is a kev component of an embodiment ol the invention. 

Tot er embSents there might be any number o, global file systems 206, each 
physical and logical name spaces. In such an environment a particular device .s accessed through ontyof the global 
20 file svstems 206 and its associated physical and logical name spaces. 

As described above in reference to FIGS. 1 -4, the prior Solaris device access system allows transparent device 
access ^ within a single computer system. Certain aspects of the way in whfch the pnor art generates theogical 
names that aTmapped'by the file system to the devj value of the device to be accessed are not compatible w.th 
Znding the current device access system to a cluster. For example, assuming that the sets o devices 1 06-1 106-2 
2S ea^h Sded four SCSI disk drh,es. the logica. naming system presently employed would result ,n different drrveson 
Z T direr* nod es 106-1, 106-2 having the same devj value. This would make it impossible for an application 150-1 
to acTss transparent^ a specific one of the disk drives on the node 202-2. It is now described how an embodiment 
of the invention provides such transparent, global device access. 

ReSg to FIG 6, there are shown additional details of a representee one of the nodes 202 and the node 204. 
so which hosts the DCS 208. The file system 206 is not shown in this figure as it resides only on one particular node 
20T2 Each node 202 includes a memory 230 in which operating system (OS) routines/objects 240 and da* structures 
So are defined The OS routines 240 include an operating system kernel 242. a proxy fiie system (PxFS) 244. a special 
me sylm 258 a devL driver framework (DDI) 270. a set of device server objects (DSO) and device dnvers 280. 
As described above, the kerne. 242 handles system calls from the applications 150 , , such as reques t 33 
3S the memoay 230 the file system 206 or the devices 106. The kernel 242 differs from the kernel 132 (FIG^ 1) as it has 
ZTS^Z present invention to support global device access. The proxy file Z*££^^£ 
on the Solaris PxFS file system but, like the kernel 242, is modified herein to support global dev.ee access. The PxFS 
TaI nctdes a coYec ion%f objects that enable an application 150-i in one node 202-i to interact seamless y wrth the 
Z systom 206 across different nodes 202. The PxFS objects include PxFS clients 246, PxFS servers 1 _pb| >W* 
40 obieS i So vnodes (virtual file nodes) 252, snodes (special file nodes) 254 and px.vnodes (proxy vnodes) 256. Each 
oHhese objects * labeled in FIG. 6 as optional (opt) as they are created as needed by the PxFS 244 ,n response to 

^TdSu^T^O (hereinafter referred to as the DD.) is a.so similar to the DDI framework 142 descried in 
reference to th^r art (FIG. 1 ). However, the DD. framework 270 is modified in an embodiment of the invent,™ to 

« n e act" h H DCS 360 and to generate physical and logical names that are com P atib.e with devces 106 that can 
be leered on and from different nodes 202. The DDI 270 includes an attach method 272 that is called every time a 
Ldev^Tatiched tothe local node 202. In contrast to the prior attach method, the attach method 272 ^configured 
to employ the services of the DCS 360 to create a globally consistent physical name tor each and eve.y attached 
device The DDI framework 270 also includes a collection of link generators 274 that generate umque logical names 

so from ^corles^ndingThe physical names. There is different type of link generator for each deferent type of device 106 
Thus 1£2S ! ™tine272 and the link generators 274 respectively build the physical and logical name spaces that 

render the devices 1 06 globally visible at the kernel and user levels, respectively. 

An embc^rmen of ft. invention includes a set of DSOs 290 on each node of the cluster 200, each of which 

man^esap^ 

ss Lt capture the particularity wrth which a user's request to open a particular dev.ee 106 must be satisfied by th 
^sparent gtobal device access system, generally, and the DCS 372, in particular. In the prefe rred embodiment there 
^SSLe.- dev enumerate 314. dev_node_specifie 316, dev.globa. 318 and dev "odebound 320 and 
Sr^SSS*?D8^ ™- DSO.enum 292, DSO.nodespee 294, DSO.g.oba. 296 and DSO.nodebound 298. 
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~i-t.H with devices 106 that can have multiple instances at a particular nod 
The dev_enumerate class 31 4 ,s device is attached (e.g.. multiple storage dev,ces 

202 that are enumerated by the.r ass^ted dnver 280_ there js on , y one jnstance per node e . 

120). The dev_nodes P ecif.c class 316 .s assented with £ dr „ ers 280 . The dev_globa. class 318 . lor 
g ., the kernel memory 116) and. as a ^^^SSS?urt« a driv r that is resident on each node (e.g.. 
Lse devices 106 th^^^ 
communication devices 118). The aev noaeu 

on a particular node (e.g., HA *^ 1 VJ; except they rep ort additional configuration information fccludng. 

The drivers 280 are similar to the d rivers i« i « p j ned 
when available, the device Cass ^^J^^fSaLi nodes table 306. Like many of the OS routines 

Thec^tastructures 300 include^ 
240. the data structures 300 are similar to •fJ^^J^SL of the present invention. In particular, the Devln- 
embodiesimportantdifferen^^ 

,o tree 302 includes additional intermediate nodes reqmre I to toca ^ Devlnfo Vee , the logrcal name 

As a result of changes to the physical ^^^^?J?FlnB* the ddi_minor_nodes table 306 includes 
space 304 is also different from the prior art ^J™^^* b y the prior art. For example, the present 
additional fie.ds as compared to the ^Z^S^T^J^ (device) class fields 308, 310 and 312 
odLminor nodes table includes V™*-™*^^^™^ either of the fieKs 308 or 31 2. 
(described above); the prior art dd._m.nor_nodes table on i not ■ ^ and structures 370. The 

( The node 204 includes a memory 330 in which are *J^^S? fnap minor method 362 on the DCS 
OS SnSobjects 340 include the device conf.gural.on system jPJ^ri Seclude a DCS database 372. 
and a s t of DSOs 290 identical to those -^^^.^ a ' e_ t two important functions. First, the DCS 
The DCS 360. for which there is no analog in ^^^T * ttacned devjces that allow those devices to be 
360 works w*h me DDIs 270 toassign global minor nurnbe, ^»o ne^att ^ ^ ^ ^ tQ ^ 

g ,oba.fy and transparently accessible. Second the DCS 3^0 372 nolds in persistent storage 

a noS 202 _nd the DCS 360 in the node 20J ^J^^^^i „ n k generators 274, the DCS 360, and 
for a device 380 being attached to the node 202 Co n «*^ and messages are indlca ,ed in the 

extensions thereof act as a device registrar for ^^ S ^rm5 in the flow digram, the relationship between 
same manner as in FIG. 4A. Before describing theopera .oris re p ^ ^ RQ __ 

some of the name spaces managed by an ^^^^Lo, name/number space 307. physical name 
Referring to FIG. 7B. there is shown a conceptual JJBJrtJ invention for an exemplary cluster including 
; space 305 and logical name space ^^^J^S^^^B is attached to a node 202 its driver assigns it 
,wo nodes 202-1 .202-2. As is described below, each time a d vice ^ ^ _ 

a local minor number 307_num and name 307 -" am ^ e < ame fortne devic e 106. The physical name 305_name 
minornumberand to formaglobally unique » ^ * ^Lors 274 then map the physical name 305_name 
Ltes the devtee in the cluster's device h.erarchy Z^SffiSSo-l. 270-2 and the link generators 274-1. 274-2 
40 to a globally unique logical name ios, 304, respectwely. In contrast, each driver gen- 

joint.? generate common global ^^T^e ^lZ ^ embodiment maps local minor names/numbers to 
erates a minor name/number space only for. are a part of the file system 206. Consequently, an 

each device associated with the just attached .nstance. Tr^ create minor _name 384 assigned by 

so juration of the device 380, including a local ^^t^Z^l of the classes 312. For example, if the 
L appropriate device driver 280 and la » £ t Lnor.num, minor.name and ctess might be V.. 

device were the third SCSI disk drive attached to the , no** . resp ectively. 

•a- (indicating that it is the first slice on that 270 updates the ddiminor.nodes table 380 by 

In response to the create_minor_nodes message (7 1b) ne w r ^ 2 _ Q then |SS(jes a 

SS setti g he loca._minor.num field 310 equal ^^^^^^^^^r^n^^B 
dc map minor message (7-3) to the DCS ^r""**^^^ depends on the device class. That * 

^tfa^^ 5— ^ ^ deV - 9 '° ba ' ^ 
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dev nodespecific devices do not. The dc.map.minor messag (7-3) has three fields : (1) •gm.nor-. wh.tf • J ton 
MdiSthSohal minor number 388 generated by the DCS 360; (2) tminor'. wh.ch holds the local minor number 384 
cJ^^SS ck*« 280; and (3) -drt which holds the devic ctess 386 generated by th i dev.ee dnver 
ST! response to the map_minor message (7-3) the DCS 360 fcsues a simitar ds.map.minor message (7-4) to th 
local DSO 290 for the class identified in the message (7-3). , ec . „ . . 

T^e DSO 290 among other things, determines the global minor (gmin) number 388 that should be assigned to 
the d^vlcfblg attached How the gmin number is assigned depends on the class 386 of the device For xamp.e, 
2 D-SO 292 ^to?^ dev.enumerate c. a ss 314 assigns each dev.enumerate device a gmin nurnber 368 hat ,s un,que 
across the cluster because each enumerated device must be accessed at a spec.fic node. In contrast the DSO 296 
to Z dev Qbba etas 318 assigns each dev global device the same gmin number as it is .mmatenal at wh.ch node 
^d£^a2£ As tor the other classes, the DSO 294 for the dev.node specific class 316 ass,gns each 

^c^ 

ripvice of that class a qmin number that is unique across the cluster 

Z ?DSOs 292 ,298 assign gbba. minor numbers by first consuming the DCS database 372 to detemrune wh.ch 

^TnTS^ storage and includes, for all devices 106 in the cluster 200, fields for 

sX class 386 and numerical value 394). The minor name, major number, global m.nor number and tedmm 
numbed hS/e a ready been described. The numerical value 394 identifies the node 202 that » the server for the dev ce 
iTZ^ ^orwa^ is optional for dev_globa. and dev.nodespecific devices as the identity of a server for 
irn^ and. tor the second case, is the same as the location of whatever node w.shes to access 
the device. An example of the DCS database 272 is shown in Table 1 . 

TABLE 1 



device (not a 
field) 


major 
390 


global minor 
388 


Internal minor 
382 


device server Id 392: 






server class 
386 


numerical value 
394 


tcp 

kmem 
disk 


42 
13 
32 


0 

12 

24 


0 

12 

24 


dev_global 
dev_node_ spec 
dev_enum 


0 
0 

node id. 


c2t0d0s0 

kmem 

kmem 

kmem 

kmem 

HA devices 


13 
13 
13 
13 
M 


1 
2 
3 
4 
X1 


12 

12 . 
12 
12 
X1 


dev_enum 
dev_enum 
dev_enum 
dev_enum 
dev_nodebound 


node 0 id 
node 1 id 
node 2 id 
node 3 id 
id 



25 



30 



35 



40 



45 



SO 



55 



The first line of Table 1 shows an entry for a tcp interface. A tcp interface is a dev_global dev.ee as it can be 
accessed f^eve^ 

Sth aCSlrs Note that rts global and local minimum values 388, 382 and server numencal value 394 (. e nodejd) 
JSJoTmS. it I immaterial from what node the tcp hterface is accessed ^^^^ 
one tco enta/ in the DCS database tor the entire cluster 200. The second entry in Table 1 .s for a kernel memory devee. 
^bTc^i^cDaeeed locally. For this reason, it is of the dev.nodespecific class. The major number 13 .s 
^^t^^ device driver. The kmem device has a null numerical value 394 as kmem devices are not 
ZZ^W^SZ" server and identical, non-null global and local minimum numbers (12). Th.s ,s the case as, 
toTSTn^JSc devices the DCS 360 simply assigns a global minor number that is identical to the loeah ..nor 
number: Mhe'present example, there is onfy one kmem entry of the dev.nodespecrf. var^y ,r , the DCS database 
372 as there is no need to distinguish between the kmem devices located on respect.ve "odes 202_ 

The third entry is for a SCSI disk cOtOdOtO whose SCSI driver has ma]or number 32. The DCS 360 has assigned 
the S device a ftal mhor number 388 that is identical to its local minor number 382 (24) as there are no other 
SCsl del^s represented in the DCS database 372. However, if another SCSI device cOtOdOtO were registered at a 

^TTd^hSCS. devices wrth the same local numbers, the DCS database 372 includes comp.ete server .for- 
mation. In this case the numerical value 394 is set to the hostid of the server 202. 
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Entries four through seven are for four k mel memory devices that are registered as dev.enum rate d vices In 
the preferred embodiment, each time a dev.nodespecific device is registered, additional entnes can be creat d m the 
ncs database 372 for all of the nodes 202 in the kernel, which allows a user to access a d v_nodespeciftc d vice on 
otSrShe S nod Consequent.* assuming there are four nodes 202-1 . 202-2. 202-3 and 202-4 . the DCS 260 
"ra^elmemo^devtoolth de v_enumerate cbss for each of those nodes. As wrth other dv.enumerate 
SvicesTach kmem d vice is assigned a unique global number. The dev.enumerate inform** ^ ^Jrvotbe used 
when a user issues a generic request to open a kernel memory dev.ce (e.g.. open(/de V ,cesAmem» The 
dev enumerate information would be used when a user issues a specffic request to open a kernel memory dev.ce. 
For example the request open(/devicesfrmem0) allows a user to open the kmem dev.ce on node 0 

tS final entry shows how a generic high avaHability (HA) device is represented n the DCS database 372. The 
major number 390. global minor number, and local minor number are taken from the values M XI and 1X1 provided ,n 
me map minor nodes message. The numerical value 394 is set to the id of the dew*, which ,s bound to a particular 
node This 'id' is not a node id. Rather, the id is created uniquely for the cluster 200 for each HA service. 
" On^the gfoba. minor number 388 h determined for the device 380. the °»^T^ 5^ 

database 372 with the new information (7-5) and returns the global m.nor number 388 to the DCS 360 (7-6). The DCS 
372 then returns the global minor number 388 to the DDI 270 (7-7), which updates the dd, m.nor_nc^es table 306 
?7 9) the Sname space 304, the physical name space 305 and the dev.info tree 302 (7-9). The DDI 270 updates 
L ddtmlncr nodes table 306 by writing therein the new global minor number 388. The update to the name spaces 
304/305 is more complex and is now described. 

First the DDI 270 adds a new leaf node to the Devlnfo tree 302, the structure of whch .s changed from that 
previously described in reference to FIG. 3 to include, just below the -/devices" node, an addittona. level of Vhostid- 
n^es to represent the cluster sites where dev.enumerate are attached. Note that each node 202 has its own Dev nfo 
tree 270 that represents the devices on that node. However, as represented by the physical name space the collects 
of Devlnfo trees is merged into a single representation with the additional /hostid nodes, (e.g a typ.ca. physea name 
might start out with the string, /devipes/hostid/...). Each device is also associated at the leaf level w.th .ts gtoba. m.nor 
Zcer 388, not its local minor number 382. Where relevant (i.e.. for dev.enumerate dev.ces) the dev t value of each 
leaf node of the Devlnfo tree 302 is derrved from the corresponding device's global m.nor number 388 and rts driver^ 
major number 390. For example, the physical path to a SCSI disk on a node 202-x with a global minor number GN 
3r name MN. and drrver sdQaddry is represented in the present invention as: / d eV ,ces/node_202-x/ l orr mu@ addr/ 

^S^l^^^!^^ the physical name of the UFS file 170 (FIG. 2B) that includes configuration 
information for the given device including, in its attributes, the dev.t value derived from the major and global m.nor 

nUm ?he S 'link generators 274 of the present invention derive a logical name for the device (and for the corresponding 
UFS) from at least a portion of the Devlnfo path and the minor name provided by the driver mod.fted ,n accordance 
with the alobal minor number returned by the DCS. . 

For example assume that the node 202-1 has one SCSI disk with four slices originally .assigned by rts driver minor 
names a-d and minor numbers 0-3 and the node 202-2 has one SCSI disk with six slices assigned the minornames 
a f and minor numbers 0-5. Assume that, when these devices are attached, the DCS 360 returns for the first SCSI disk 
*£hri£ numbers of 0-3 and for the second SCSI disk global minor numbers of 4-9. Using these global minor 
numbers the DDIs 270 create physical names (described below) and the link generators 274 use the DDIs 270 to 
create logical names that map to the physical names as follows:: 



45 



SO 



ss 



minor name from driver 280 


logical name from link generators 274 


a (node 202-1) 
b (node 202-1) 
c(node 202-1) 
d(node 202-1) 
a (node 202-2) 
b (node 202-2) 

f (node 202-2) 


/dev/dsk/c0t0d0s0 
/dev/dsk/c0t0d0s1 
/dev/dsk/c0t0d0s2 
/dev/dsk/c0t0d0s3 
/dev/dsk/dt0d0s0 
/dev/dsk/c1t0d0s1 

/dev/dsk/cH0d0s5 



The loqical names assigned to the node 202-1 and 202-2 devices have different cluster values the cx part of the 
logical name string cxtOdOsy, where x" and y are variables). This is because the logical names map to dev.ce physical 
S and in "cluster, devices on dflerent nodes are assorted with different controllers. For example, the node 
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902-1 controller is represented as cOand th node 202-2 controller as d. 

toal^^^%M names identifying files whose attributes contain the dev.t values for the corresponding 
SSesToZe above example, the logical name space 304 and the logical name space to phystcal name spac map 
is updated as follows (note that addr substitutes for any address): 



70 
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ss 





physical name from Devlnfo tree 302 


/dev/dsk/c0t0d0s0 
/dev/dsk/c0t0d0s1 
/dev/dsk/c0t0d0s2 
/dev/dsk/c0t0d0s3 
/dev/dsk/dtOdOsO 
/dev/dsk/c1t0d0s1 
/dev/dsk/c1t0d0s2 

/dev/dsk/c0t0d0s5 


/devices/node_202-1 /iommu@addr/sbus@addr/es p1 @addr/sd@0.a 

/espl@addr/sd@0:b 

/espl @addr/sd@0:c 

/espl @addr/sd@0:d 
/devices/node_202-2^ommu@addr/sbus@addr/es p1 @ add r/sd@0: minor 

/espl @addr/sd@0:e 
/espl@addr/sd@0:f 

/esp1@addr/sd@0:i 



The example iust presented shows the DDIs 270 generate logical and physical names for dev.enumerate devices, 
of JSS^iMcii are a member. Briefly summarfced, the rules for naming dev.enumerate dev.cesrequ.re 

by a particular drK,er (e.g., sd) must have a unique global minor numben wh^when 
combed with its driver's major number forms a corresponding, unique dev.t value. These rules also spec J that th 
physica! naCe atsocated wiih each instance must include the hostid of that instance and the instance's globalm.no 
Ser in additrcn to other tradrtional physical path information. The rules for nam.ng the other devces from the other 
classes are similar to those described above for the dev_enumerate class. „ M 

TparticuSl the DDI 270 assigns a dev.nodespecif.c device a logical name of the form /dev/dev^name and 

physical name of the form: 

/device$/p$eudo/driver@gmin:device_name, . j„wiMh fl w 

where device Zme is the name 384, pseudo indicates that devices of this type are pseudo dev.ces. dnvens the .d 
of the corresponding driver and @gmin.device_name indicates the global number 388 and dev.ce name 384 of th 
l^tSSSSSL. For example, the logba. and physical names of a kerne, memory device could be — 
Z'deZSS P seudo/mm & l2: k mem, respect^ely. As mentioned above, a kmem dev,ce can also be grver a logical 
name** enables it to be accessed on a specific node. For example, the DDI 270 can map the log.cal name /dev/ 
kmemO to the physical name /devicesMostidO/pseudo/mm@0:kmem. 

Tor the dev globa. class each logical name generated by the DDI identifies a common phys.cal pa* tha ^w.H be 
resoled to any device in the Custer 200 by the file system. Logical names for these deuces are of the form /dev/ 
device_name and are mapped to physical names of the form: 

to the driver, pseudo indicates that devi c = ypea, 

pseudo devices, clone indicates that the device te c.oneab.e and •f^^T^^^^^S 
388 and device name 384 of the dev global device. For example, the tcp dev.ce from Table 1 might have a logcal 
Sme of SpTd a physical nam7of /devic e s/ PS euda/c,one @ 0:,c P . Note that the embodiment of 
does not aL any of dev_globa< devices to be made dfetinguishab.e. as in the case of the kmem dev.ces. descr.bed 
ahnx/P That is all dev qiobal devices are indistinguishable. 

An aSamag of The class-based naming system of an embodiment of the invention is that it .s ™£Hb^ 
leqacy software designed for prior versions of Solaris. For example, a legacy program m.ght .ssue an open(/dev/kmem) 
request in wh ich case a version of Solaris embodying the present invention returns a handle to the ^™?™? ot 
Star results are provided for dev.globa. and dev.enumerate devices. There was no concept™ ,n the pnor art for 

ofdev^caTbr^ 

to respond to an open request for a device on another node .s now descr.bed in reference to FIGS. 8A and 9B. 

Serrinq to FIGS 8A and 8B. there are shown flow diagrams of the steps performed by an embodiment of the 
inventton n X>nse to a request (8-1) from an application 150 executing^ a node 202-1 to access (open) a device 

06-2 FIG resfdeTon a node 202-3. In this example, the file system 206 and the DCS 360 res,de on the 

ots20^ 

nate The kernel 242 then queries the file system 206 to determine the device's dev.t value. Because the Me system 
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is on a different node from the kernel 242. this is a multistep process that involves the us of a proxy file system PxFS 
mS as ects of which are already defined by current versions of Sotoris. However the ™^™«^™ot 
proxy file system elements as PxFS c.ients 246 and PxFS servers 248 to support .meractjons w *h ^£380. to 
which ther is no analog in prior versions of Solaris. The interact.ons between th PxFS client 246. PxFS server 248 
5 and tho file svstem 206 are now briefly described. 

An'o'blS such as the kernel 242 that needs to access th file system 206 first issues the access request to ,te 
.cJpxF cl ent 246. The PxFS client holds a reference to the PxFS server 248 co-located with the file system 206^ 
TnTs rXence enables the PxFS client 246 to communicate the kernel's request to the file system 206 v.a the PxFS 
seter^ 

,o "e id returns a reference to vnode object 252 to the PxFS server 248. Because ^^^.^"S 
different address spaces, the reference to the vnode 252 is useless to the PxFS cl.ent 246 and kernel 242 ,n the node 
202 1 cSsequentV, the PxFSse^ 

a reference to the t obj 150 to the PxFS client 246. Upon receiving the t_obj reference the PxFS cl.ent 246 creates a 
prox^ vnode (px_vn~ode) 256 that is linked to the f_obj 250. The kernel 242 can then access the file .nformation repre- 

is sentedbv the vnode 252 by simply accessing the local px.vnode 256. 

Using this mechanism the kernel 242 issues a lookup message (8-2) on the logica. name of the device to be 
opened to the PxFS client 246, which relays a simifcr .ockup message (8-3) to the PxFS sever 248. The PxFS server 
Z £Z ttefiie system 206 a looku P <logicaLname), get.vnode message (8-4). which asks the f,.e system 206 to 
map the logical name to the corresponding physical_name via a logical symbolic link return a reference to a v.nod 

2 o 252 representing the UFS file identified by that physicaLname. When the physical name refers to a device « nr Mhe 
present example, the attributes of the device include the unique dev_t of the device. As described above, the file system 
206 then returns the vnode to the PxFS server 248 (8-5) and the PxFS server 248 creates a corresponding , f obj 250 
and returns the f obj 250 reference to the PxFS client 246 (8-6). The PxFS client 246 then creates a px.vnode 256 
whose attributeslnclude the dev.t information for the requested device and passes the P x vnode 256 referenceto 

2S the kerne. 242 (8-7). At this point, the kernel 242 issues an open message (8-8) to the PxFS cl.ent 246 for the px.vnode 
M6 £on recLing this message, the PxFS client 246 determines from the px.vnode's attr.butes. wh.ch include a 
dev t value, that the corresponding vnode 252 represents a device and therefore the open rmssage must be handled 
by the DCS 360 I. the px vnode 256 did not contain a devj value, the PxFS client 246 would sat.sfy the open request 
(8-5) through other channels. As implemented in prior versions of Solaris, the PxFS client does not perform any testing 

30 for dev t values as devices are only locally accessible. , a Q > 

Because the px vnode 256 includes a devj value 430, the PxFS client 246 .ssues a resolve message (8-9) to 
the DCS 360 for the'device corresponding to the devj. How the DCS 360 handles this request .s now described .n 

refe Re!e e rSg F to m. 8B. in response to the resolve(dev_t) message (8-9) the DCS 360 performs a lookup in the DCS 
S s databa e 37 9 2 to determine the location and identity of the device that corresponds to that de ^^^^ 
L preceding discussions of the device classes 312, devices of the *"-™™<*«;<^ 
accessed on a particular node whose location is specified in the numerical value field 394 of the DCS database 372. 
fn conuit dS of the dev global or dev.nodespecific classes are accessed on the local node of the requesting 
app'aS Oncet has determined the location of the device to be opened, the DCS 360 returns (8-10) to the PxFS 
AO cHent 246 a reference (DSO ref) to the DSO 290 that manages the device class to which the requested device belongs 
and is .Sal to the node thafhosts the requested object. In the present example, assuming that the requested dev.ce 
Si Itfu ^.enumerate cfcss and is hosted on the node 202-3. the returned DSO_ref would be to the DSO.enum 

0bje Aftefre?e.Ving n me e mes2ge (8-10) the PxFS client 246 issues a get_device_fobj request for the device 106-2 to 
<5 the referenced DSO 292 (8-11 ) In response, the DSO 292 issues a create_specvp() message 2) asking the SpecFS 

the f obj reference to the snode from the PxFS server 248-2, which returns the requested f.ob] (8-1 4b). The DSO 292 
thenfeturns the (obj reference .0 the snode to the PxFS client 246 (8-1 5). The c.ient 246 then issues an open request 
(8-16) on this fobj that goes to the SpecFS 410 via the PxFS server 248-2 (8-17). 

so ( The SpecFS 410 then attempts to open the device 106-2. Depending on the outcome of the open operat.on the 
SpecFS 41 0 returns a status message (8-1 8) indicating either success or failure If the open ^ SSU t C h Ce ^ 
message (8-18) also includes a reference to the opened snode 432. Upon rece,v.ng "success in the status message 
(8 the PxFS server 248-2 creates the f obj 250-2 for the opened v.node 252-2 and returns .t back to the PxFS 
Sent a£ (8 19) Zh c^s a px vnode' 256-2 that is linked across nodes to the f.obj 250-2. As the final step in 

ss Tdev^e open oration the PxFS c.ient returns the px.vnode 256-2 to the kernel 242 (8-20), which creates a cor- 
responding user space file descriptor (fd) 434. The kernel 242 returns this file descriptor to the a P pl.cat,on 1 50-1 (8-21 ) 
which can then usfthe file descriptor 434 to interact directly (i.e., via the kerne. 242. PxFS client 246 and px.vnode) 
with the device 106-2. 
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While the present invention has been described with reference to a few specific embodiments, the description is 
skilled in the art without departing from the scope of the invention. 



Claims 



1. A system configured to provide global access to physical devices located on a computer cluster comprising a 
plurality of nodes, the system comprising: 



10 



a global file system; 

a device configuration system (DCS); 

the global file system responding to a request to access one such phys.cal dev.ce .ssued from one of the 
nodes by requesting a DSO handle from the DCS; 
15 at least one device server object (DSO); . 

meDCSdLLninginresp^ 

device and returning to the global file system a reference to the first DSO , 
me^ 



20 



The system of claim 1 . wherein the DCS is hosted on one of the nodes, the system further comprising: 



a common operating system kernel running on each of the nodes in the computer cluster; 

a device driver interface (DDI) running on each of the nodes; and mMaM 

a plurality of device drivers located on each of the nodes, each of the dev.ce drivers bemg configured to manage 

25 on e type of physical device and being associated with a unique, major number; 

eachTvice dLr being configured, when a new device of an appropriate type . attached to a respect^ 
node to issue an attach message to the DDI indicating a local identifier (load) of the new device being attached, 
th?DD°beingconfigured, in response to the attach message, to issue a map request to the DCS for a unique^ 
global minor (gmin) number for the attached device, the map request indicating the ma,or number and the 

so incid of the device being attached; 

the DCS being configured, in response to the map request, to (a) determine the gmin numberjb) return the 
omin number to the DDI. and (c) store the gmin number, the major number and the gmin number, 
?he DD being configured to associate the gmin number returned by the DCS and the major "umber with the 
StecheddevLsothattheattacheddeviceisaccessiblefrom the file system ,n response to a request to open 

35 the attached device. 

3 The system of claim 1 , wherein the DCS, file system and requested device are each on different nodes, the system 
£Z comprising a proxy file system enab.ing applications on one node to communicate transparently with file 
objects co-located with the requested device on another node. 

The system of cbim 1 . wherein the at least one DSO comprises a set of device server objects on each node of 
the cluster, each of which manages a respective device class. 

The system of claim 4, wherein the device class is a member of a set of device classes including at least one. of : 

45 "dev enumerate," for designating devices with at least one occurrence managed by a particular driver each 

of the" occurrences managed by the particular driver on a particular node being mdrvrciually enumerated, 
■<Z nXpecinc," for designating devices availabte on each node that are accessed locally and have a one- 
to-one relationship with the managing device driver on each node; , 
■dev global.' for designating devices that can be accessed by such dev.ce dnvers from on "W^^"* 
•devlnodebound,- designating devices that are accessed by a driver on a particular node and have a one-to- 
one relationship with the device driver. 
6. A method configured to provide global access to physical devices located on a computer cluster comprising a 

£5 plurality of nodes, the method comprising the steps of: 

a global file system responding to an access request to access one such physical device issued from one of 
the nodes by requesting a DSO handle from a device configuration system (DCS); 



4. 



5. 
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the DCS determining in respons to th access request an identity of a first device server object (DSO) i as- 
sociated with th requested physical device and returning to the global file system a referenc to the first DSO; 
the global file system returning a file descriptor for subsequent use in accessing the requested physical d vice. 

5 7, Th method of claim 6, further comprising the steps of: 

each of a plurality of device drivers, when a new device of an appropriate type is attached to a respectiv 
node issuing an attach message toa co-located device driver interface (DDI) indicating a local identifier (load 
of the new device being attached, each of the device drivers being configured to manage one type of phystcal 
10 device and being associated with a unique, major number; 

the DDI in response to the attach message, issuing a map request to the DCS for a unique, global minor 
(qmin) number for the new device, the map request indicating the major numberand the load of the new device; 
the DCS, in response to the map request: (a) determining the gmin numbe and (b) returning the gm.n number 

75 TeOQ\ abating the gmin number returned by the DCS and the major number with the new device so that 

the new device is accessible from the file system in response to a request to open the new device. 

8. The method of claim 7, further comprising the steps of: 

20 the device driver issuing device configuration information to the DDI including class information, if available, 

for the new device; and 

the DDI including the class information, if available, in the map request. 

9. The method of claim 8, further comprising the steps of: 

upon receiving the map request the DCS consulting a local DSO associated with devices whose class is the 
same as that of the new device; and 

the local DSO determining the gmin number to be assigned to the new device. 
so 10. The method of claim 9, further comprising the step of: 

when the class information is not provided by the device driver, accessing the new device as if the new device 
were of a dev.enumerate class including devices with at least one occurrence managed by a particular driver, 
each of the occurrences managed by the particular driver on a particular node being individually enumerated. 

35 



40 



45 



SO 



55 



BNSOOCID: <EP 0889400A 1 _l_> 



14 



EP 0 889 400 A1 



# 



0) 

o 



CD -Z> 



CD 
O 

0) 00 

o 



o 

Is 



v. 



CO 
0) 

o 

> CD 

S2 



£ 2 £ £ ° S 



111 I 

E g; q> <d 

^ . CD O 

u 1= E E £ cd 
< -I c5 o 3 S> 



co 

cu 
o 

'> 
a? 



CO 



o 
o 



O CNJ 
CD CD 



CD 

co a 
a 

a. co 
co 



CD 

o 
c 



E 

52 
to 
>> 

00 _ 

cd a> 
c c 

To ^ 

CD " 3 ~ 

CL 

o 



£ CO 
0) o 

a) 



CO 
Q_ 

O 

CD 

•o 

O 

c 
> 



a) 
£ 

CO 



CD 

e 

CO 



CO _ 
U- to 
o .y co 
co co a> co 
LL u. o. o «*= 



00 

^ a5 
15 > 
o *— 

s Q 

CD 

a 

*> 

CD 

Q 



CO 

o 
a3 



J 
o 



E 

! 

(D 

CD *- 

.E o 
-J "O ^ 

CO 

o 
a. 

Q_ 
< 



CD 
CO <JJ 

11 

ra O 
a 



CL 


CMI 


o 





o 

to 



8NS0OCID: <EP 0889400A1 J_> 



15 




BNSDOCIO. < E P 0889400A 1 J _> 



r 




BNSDOCIO: <EP 0889400A1 1 > 



< I 

EP 0 889 400 A1 




18 

BNSDOCIO. <EP_0889400A1 J_> 



r EP 0 889 400 A1 



CO CO CO CO CO 
I I I I • 
CN V CO 00 O 




BNSOOCtD: <EP 0889400A1_I_> 



EP 0 889 400 A1 



r 



o 

CM 
CD 

"a 
o 



o o o cm o 

CO CD CD CO 
CO CO CO CO CM CO 



O CM 
CO CO 



o 
CD 

O 
"lo 

CD 

c 

^ o co 



o 
c 



CO 

£ 



o 
c 



CL 
CO 

£ 



00 
CO 

iS 



co CO 







lass 




o 




6; CM 




number 




inor_ 




£ 

i 




local. 

ill) 




number 




inor_ 




e , 




global 
308 





00 
CO 

To 
o 

CD 

>' 

CD 
"D 

CO* 

CO 
O 

*E 
"o 

CD 
CL 
CO 
0) 
"O 
O 

c 

>' 

(D 

CO 00 
E CD 

3 -a 
c o 

CD C 

>'>' 

CD CD 

■o -a 



NT 

CN CO 

O o\ 

CO CO 



o 

CO 
CM 



O 
CM 



CvjTTCDCOOCM^tOOOOCM^OCVJJgCOO 

^ v uo io to m m n s s q a) ( o) S S S?, 

CNCNCAJCMCNOJCNCNCMCMCNCMCMCMC^CMCNCM 



r 



CM 
O 
CM 

CD 
T3 
O 



a 

3i 
!q 

O 
IS 
a> 
_c 

"5 
o 
a* 
gco 

CD 



v * 

(D 



CL 
O 



CD 

& 
CD 

a co 
CO CO 

<u CO X x 

£ Ll. CL CL 
a> X 

^ CL 



co »— x — ^ — _^ o 



. CL CL CO 

ao o (d 



CO 
00 CD 

51 



co 
CD 

-a 
o 
c 

CO 



o 
c 

> CO 
CO Q 



Li- ^ 
— co 



■a 

o c 
cl o 

CO -Q 

CD 9? CD 

o -2 o 
c cn c: 

o o o o 

co co co co co 
. £ O Q Q Q Q 

CO 
Q 



co 
i— 
o 

"co 

CD 

c 

CD 



£ 

CD. 



o 
o 

CO 



co 
CD 
"O 

CO <D 9 



3 
CO 

co 

a 



> ._! 
Q -a 



CD 
O 

00 
CD 
O 

*> 
<D 

Q 



0889400A1 J _> 



20 



EP 0 889 400 A1 




21 



BNSDOCIO. <EP 0889400A 1 _l_> 




22 



BNSDOCID: <EP 0889400A1J_> 



EP 0 889 400 A1 




23 



BNSDOCIO. <EP 0889400A1_I > 



EP 0 889 400 A1 




24 



0889400A1_I_> 



European Patent 
Office 



EP 0 889 400 A1 
EUROPEAN SEARCH REPORT 



Application Number 

EP 98 30 5133 



DOCUMENTS CONSIDERED TO BE RELEVANT | 




Category 


Citation of document with indication, where appropriate, 
of relevant passages 


Relevant 
to claim 


CLASSIFICATION OF THE 
APPLICATION (tnLCL6) 


X 
Y 

Y 
A 


EP 0 780 778 A (SUN MICROSYSTEMS INC) 1 
25 June 1997 

* abstract; claims 1,2,6; figures 
2,3,4,4A,4B * 

"MULTIPLE INSTANCES OF A DEVICE DRIVER IN 
THE KERNEL- 
IBM TECHNICAL DISCLOSURE BULLETIN, 
vol. 32, no. 10B, 1 March 1990, page . 
382/383 XP000097923 

* the whole document * 

WELCH B: M A COMPARISON OF THREE 
DISTRIBUTED FILE SYSTEM ARCHITECTURES: 
VN0DE, SPRITE, AND PLAN 9" 
COMPUTING SYSTEMS, 

vol. 7, no. 2, 1 January 1994, pages 
175-199, XP000577569 

* abstract * 

* page 180, line 20 - page 181, line 30 * 

* page 190, line 16 - page 192, line 5 * 


1,3.6,8 
2,7 

2,7 
1,6 


G06F9/445 
G06F17/30 


TECHNICAL FIELDS 
SEARCHED (M.CL6) 


G06F 




The present search report has been drawn up tor all claims 




| THE HAGUE 14 October 1998 Kingma, Y 


| CATEGORY OF CITED DOCUMENTS T: :£ory 

- ^t^m of the sarno category l : document ated for other reasons 

5 X'SE^JEE?" iTn^o.'W same paten, tarn*, coring 

p P : intermediate document document 



25 



0889400A1 J_> 



